18 research outputs found
Simpson's Bias in NLP Training
In most machine learning tasks, we evaluate a model on a given data
population by measuring a population-level metric . Examples of
such evaluation metric include precision/recall for (binary) recognition,
the F1 score for multi-class classification, and the BLEU metric for language
generation. On the other hand, the model is trained by optimizing a
sample-level loss at each learning step , where is a subset
of (a.k.a. the mini-batch). Popular choices of include cross-entropy
loss, the Dice loss, and sentence-level BLEU scores. A fundamental assumption
behind this paradigm is that the mean value of the sample-level loss , if
averaged over all possible samples, should effectively represent the
population-level metric of the task, such as, that .
In this paper, we systematically investigate the above assumption in several
NLP tasks. We show, both theoretically and experimentally, that some popular
designs of the sample-level loss may be inconsistent with the true
population-level metric of the task, so that models trained to optimize the
former can be substantially sub-optimal to the latter, a phenomenon we call it,
Simpson's bias, due to its deep connections with the classic paradox known as
Simpson's reversal paradox in statistics and social sciences.Comment: AAAI 202
GameEval: Evaluating LLMs on Conversational Games
The rapid advancements in large language models (LLMs) have presented
challenges in evaluating those models. Existing evaluation methods are either
reference-based or preference based, which inevitably need human intervention
or introduce test bias caused by evaluator models. In this paper, we propose
GameEval, a novel approach to evaluating LLMs through goal-driven
conversational games, overcoming the limitations of previous methods. GameEval
treats LLMs as game players and assigns them distinct roles with specific goals
achieved by launching conversations of various forms, including discussion,
question answering, and voting. We design three unique games with cooperative
or adversarial objectives, accompanied by corresponding evaluation metrics, to
show how this new paradigm comprehensively evaluates model performance.Through
extensive experiments, we show that GameEval can effectively differentiate the
capabilities of various LLMs, providing a comprehensive assessment of their
integrated abilities to solve complex problems. Our public anonymous code is
available at https://github.com/GameEval/GameEval
Learning to Program with Natural Language
Large Language Models (LLMs) have shown remarkable performance in various
basic natural language tasks, which raises hope for achieving Artificial
General Intelligence. For completing the complex task, we still need a program
for the task first and then ask LLMs to follow the program to generate the
specific solution. We propose using natural language as a new programming
language to describe task procedures, making them easily understandable to both
humans and LLMs. ~The LLM is capable of directly generating natural language
programs, but these programs may still contain factual errors or incomplete
steps. Therefore, we further propose the Learning to Program (\text{LP}) method
to ask LLMs themselves to learn the natural language program based on the
training dataset of the complex task first and then use the learned program to
guide the inference. Our experiments on the reasoning tasks of five different
reasoning types (8 datasets) demonstrate the effectiveness of our approach.
Further, our analysis experiment shows that the learned program can be directly
used to guide another LLM to improve its performance, which reveals a new
transfer learning paradigm.Comment: Work in progres
Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks
We present Unicoder, a universal language encoder that is insensitive to
different languages. Given an arbitrary NLP task, a model can be trained with
Unicoder using training data in one language and directly applied to inputs of
the same task in other languages. Comparing to similar efforts such as
Multilingual BERT and XLM, three new cross-lingual pre-training tasks are
proposed, including cross-lingual word recovery, cross-lingual paraphrase
classification and cross-lingual masked language model. These tasks help
Unicoder learn the mappings among different languages from more perspectives.
We also find that doing fine-tuning on multiple languages together can bring
further improvement. Experiments are performed on two tasks: cross-lingual
natural language inference (XNLI) and cross-lingual question answering (XQA),
where XLM is our baseline. On XNLI, 1.8% averaged accuracy improvement (on 15
languages) is obtained. On XQA, which is a new cross-lingual dataset built by
us, 5.5% averaged accuracy improvement (on French and German) is obtained.Comment: Accepted to EMNLP2019; 10 pages, 2 figure
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Evaluating the general abilities of foundation models to tackle human-level
tasks is a vital aspect of their development and application in the pursuit of
Artificial General Intelligence (AGI). Traditional benchmarks, which rely on
artificial datasets, may not accurately represent human-level capabilities. In
this paper, we introduce AGIEval, a novel benchmark specifically designed to
assess foundation model in the context of human-centric standardized exams,
such as college entrance exams, law school admission tests, math competitions,
and lawyer qualification tests. We evaluate several state-of-the-art foundation
models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark.
Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math
competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5%
accuracy on the English test of the Chinese national college entrance exam.
This demonstrates the extraordinary performance of contemporary foundation
models. In contrast, we also find that GPT-4 is less proficient in tasks that
require complex reasoning or specific domain knowledge. Our comprehensive
analyses of model capabilities (understanding, knowledge, reasoning, and
calculation) reveal these models' strengths and limitations, providing valuable
insights into future directions for enhancing their general capabilities. By
concentrating on tasks pertinent to human cognition and decision-making, our
benchmark delivers a more meaningful and robust evaluation of foundation
models' performance in real-world scenarios. The data, code, and all model
outputs are released in https://github.com/microsoft/AGIEval.Comment: 19 page
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
Artificial Intelligence (AI) has made incredible progress recently. On the
one hand, advanced foundation models like ChatGPT can offer powerful
conversation, in-context learning and code generation abilities on a broad
range of open-domain tasks. They can also generate high-level solution outlines
for domain-specific tasks based on the common sense knowledge they have
acquired. However, they still face difficulties with some specialized tasks
because they lack enough domain-specific data during pre-training or they often
have errors in their neural network computations on those tasks that need
accurate executions. On the other hand, there are also many existing models and
systems (symbolic-based or neural-based) that can do some domain-specific tasks
very well. However, due to the different implementation or working mechanisms,
they are not easily accessible or compatible with foundation models. Therefore,
there is a clear and pressing need for a mechanism that can leverage foundation
models to propose task solution outlines and then automatically match some of
the sub-tasks in the outlines to the off-the-shelf models and systems with
special functionalities to complete them. Inspired by this, we introduce
TaskMatrix.AI as a new AI ecosystem that connects foundation models with
millions of APIs for task completion. Unlike most previous work that aimed to
improve a single AI model, TaskMatrix.AI focuses more on using existing
foundation models (as a brain-like central system) and APIs of other AI models
and systems (as sub-task solvers) to achieve diversified tasks in both digital
and physical domains. As a position paper, we will present our vision of how to
build such an ecosystem, explain each key component, and use study cases to
illustrate both the feasibility of this vision and the main challenges we need
to address next